Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VA: Fix performRemoteValidation goroutine leak #7727

Merged
merged 5 commits into from
Sep 30, 2024

Conversation

beautifulentropy
Copy link
Member

@beautifulentropy beautifulentropy commented Sep 27, 2024

PerformValidation goroutines write to a buffered results channel to prevent blocking.

@beautifulentropy beautifulentropy marked this pull request as ready for review September 27, 2024 16:25
@beautifulentropy beautifulentropy requested a review from a team as a code owner September 27, 2024 16:25
@pavelnikolov
Copy link

No unit tests need to be changed?

Copy link
Contributor

@aarongable aarongable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose that the parent gRPC context is cancelled.

Today, the resulting sequence of events is:

  1. the child rva.PerformValidation gRPC detect the cancellation and quit, writing their cancellation errors to the results channel
  2. the for result := range results loop reads errors from the results channel until the errors surpass the threshold
  3. it increments the remoteValidationFailures metric, modifies the first error to indicate it occurred during secondary validation, and returns
  4. the remaining goroutines hang trying to write to the results channel

With this change, the resulting sequence of events becomes racy and non-deterministic, depending on what order the go runtime schedules selects. It might behave like the above (with the exception of leaking the child goroutines), or:

  1. the for/select loop detects the cancellation and immediately quits, returning a different kind of error describing the cancellation and not incrementing the metric

I don't love this non-determinism.

I think a cleaner solution might be simply to make the results channel a buffered channel with capacity len(va.remoteVAs). We know that number is always going to be small -- likely never more than 6, even with the new BRs MPIC requirements -- and the items in it are small because they're all pointers.

This way all the rest of the logic can remain untouched: the child goroutines are guaranteed to be able to write to the results channel and exit, and the loop reading from the results channel can continue to early-return as soon as it gets enough results without having to worry about checking for cancellation itself.

That said, I think creating a new context for the child gRPCs and cancelling it when this function returns is still a good idea, as it will cause them to bail out faster.

va/va.go Outdated Show resolved Hide resolved
va/va.go Outdated Show resolved Hide resolved
@beautifulentropy beautifulentropy force-pushed the fix-perform-validation-goroutine-leak branch from 44129e7 to faf4324 Compare September 30, 2024 16:19
@beautifulentropy beautifulentropy merged commit ab69b72 into main Sep 30, 2024
12 checks passed
@beautifulentropy beautifulentropy deleted the fix-perform-validation-goroutine-leak branch September 30, 2024 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants